Written Texts as Statistical Mechanical Problem
نویسندگان
چکیده
In this article we present a model of human written text based on statistical mechanics consideration. The empirical derivation of the potential energy for the parts of the text and the calculation of the thermodynamic parameters of the system, show that the “specific heat” corresponds to the semantic classification of the words in the text, separating keywords, function words and common words. This can give advantages when the model is used in text searching mechanisms.
منابع مشابه
A Phrase-Based Statistical Model for SMS Text Normalization
Short Messaging Service (SMS) texts behave quite differently from normal written texts and have some very special phenomena. To translate SMS texts, traditional approaches model such irregularities directly in Machine Translation (MT). However, such approaches suffer from customization problem as tremendous effort is required to adapt the language model of the existing translation system to han...
متن کاملBalanced Corpus of Contemporary Written Japanese
Construction of 100 million words balanced corpus of contemporary written Japanese is underway at the National Institute for Japanese Language. The unique property of the corpus consists in that the majority of its sample texts are selected randomly from well-defined statistical populations covering wide range of written texts.
متن کاملParCor 1.0: A Parallel Pronoun-Coreference Corpus to Support Statistical MT
We present ParCor, a parallel corpus of texts in which pronoun coreference – reduced coreference in which pronouns are used as referring expressions – has been annotated. The corpus is intended to be used both as a resource from which to learn systematic differences in pronoun use between languages and ultimately for developing and testing informed Statistical Machine Translation systems aimed ...
متن کاملPredicting proficiency levels in learner writings by transferring a linguistic complexity model from expert-written coursebooks
The lack of a sufficient amount of data tailored for a task is a well-recognized problem for many statistical NLP methods. In this paper, we explore whether data sparsity can be successfully tackled when classifying language proficiency levels in the domain of learner-written output texts. We aim at overcoming data sparsity by incorporating knowledge in the trained model from another domain con...
متن کاملStatistical Techniques for Text Classification Based on Word Recurrence Intervals
The decision as to whether two texts were written by the same author is usually a difficult one. Can an analysis of how the words in a text statistically cluster shed some light on authorship? In this paper we examine both English texts and the Greek source texts of the New Testament. The mathematical techniqes developed by Shannon [1,2] and Markov have been used for a number of years to analys...
متن کامل